In [1]:
import pandas as pd
import numpy as np
from io import StringIO
import numpy.linalg as la
import matplotlib.pyplot as plt
from matplotlib import cm as cm
import seaborn as sns
sns.set(font_scale=2)
plt.style.use('seaborn-whitegrid')
%matplotlib inline

Introduction

A tumor is a mass of abnormal tissue. Malignant and benign tumors have different cell growth characteristics.

For this activity, you will use python libraries such as seaborn and scikit-learn to:

  • explore the data using vizualization tools
  • run PCA to reduce the dimension of the dataset
  • split your data into training and test sets
  • create a model to predict wheter a tumor is malignant (cancerous / deadly) or benign (non-cancerous / safe) based on the tumor properties.

Load the dataset

In [2]:
params = ["radius", "texture", "perimeter", "area",
          "smoothness", "compactness", "concavity",
          "concave points", "symmetry", "fractal dimension"];
stats = ["(mean)", "(stderr)", "(worst)"]
labels = ["patient ID", "Malignant/Benign"]

for p in params:
    for s in stats:
        labels.append(p + " " + s)

tumor_data = pd.io.parsers.read_csv("breast-cancer-dataset.dat",header=None,names=labels)

feature_labels = labels[2:]

How many features in this dataset? How many patients?

You can inspect the first few lines of your data using tumor_data.head()

Define X as the dataset with the features for all patients (not including the patient ID and diagnosis):

Inspect the mean value for each feature using X.mean()

Inspect the standard deviation for each feature using X.std()

Define Y, an array with the diagnosis for each patient. We will later use a classification algorithm to help predict patient's diagnosis (based on the features, should a patient tumor be classified as a M or B?)

For visualization purposes only, you can select a subset of the features, and use

seaborn.pairplot(data)

to plot pairwise relationships in the dataset (you could plot all of the feature pairs, but this would generate a lot of plots!)

In [9]:
mean_label = [labels[1]] + labels[2::3] 
#sns.pairplot(tumor_data[mean_label], hue="Malignant/Benign", plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'})

Or we can look at the plot of the correlation matrix for all the featues:

In [10]:
corr_matrix = X.corr()
# plot correlation matrix
fig = plt.figure(figsize = (14,8))
ax1 = fig.add_subplot(111)
cax = ax1.imshow(corr_matrix, cmap=cm.get_cmap('jet'))
plt.title('Tumor features correlation matrix')
plt.grid(False)
ax1.set_xticks(np.arange(len(feature_labels)))
ax1.set_yticks(np.arange(len(feature_labels)))
ax1.set_xticklabels(feature_labels,fontsize=10,rotation=90)
ax1.set_yticklabels(feature_labels,fontsize=10)
fig.colorbar(cax)
Out[10]:
<matplotlib.colorbar.Colorbar at 0x10f1e6d68>

Overwhelming? Don’t worry, in some cases the effective dimension of your problem might be much smaller than the number of the features, such as in data sets where some features are irrelevant. Is that the case here?

Splitting the dataset

To assess the model’s performance later, we divide the dataset into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained system.

We are going to use sklearn.model_selection.train_test_split to split the dataset

In [11]:
from sklearn.model_selection import train_test_split

A common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You should select this proportion by assigning the variable s and setting the argument test_sizes = s

We will fix the seed for the random number generator, in order to get reproducible results

Use: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=s, random_state=seed)

to split the dataset into the set of features and diagnosis used for training (X_train,Y_train) and a set used for evaluating the trained system (X_test,Y_test)

PCA

When you use algorithms for classification tasks, not all features will contribute well to the generalization capacity of your model. Some irrelevant and correlated attributes can even decrease the performance of some algorithms, contributing to overfitting, for example.

It is now your responsibility to choose the best set of features that will make your models performs better! Use PCA to reduce the feature space of the training set while still retaining the most meaningful information about the features.

1) Scaling the training set

We will now center the training data by making it have zero mean and unit standard deviation. We will first do this step explicility, like we have done in previous activities, before calling SVD to compute the variances. Let's revise how we were solving PCA before:

In [15]:
Xzero = X_train-X_train.mean()
Xscaled = Xzero/X_train.std()
U,sigma,Vt = la.svd(Xzero,full_matrices=False)
variance=sigma**2
total_variance = sum(variance)
print(variance/total_variance)
[9.80235252e-01 1.76388461e-02 1.90610748e-03 1.20447599e-04
 8.72383033e-05 6.84948034e-06 3.78245320e-06 8.65500617e-07
 3.55017700e-07 1.58090830e-07 6.51888543e-08 1.49593710e-08
 6.92494738e-09 4.50523385e-09 3.03706337e-09 1.27544572e-09
 8.53783527e-10 4.87077530e-10 4.12137475e-10 3.36099268e-10
 1.68075533e-10 1.30882995e-10 8.10116499e-11 5.40419239e-11
 3.54896936e-11 2.75410670e-11 7.61430739e-12 6.24731176e-12
 4.98193116e-12 1.63035465e-12]

From now on, we will use the PCA class from sklearn:

In [16]:
from sklearn.decomposition import PCA

The PCA class scales the data to have zero mean. We can also use the method

pca.explained_variance_ratio_

to obtain the explained variance.

In [17]:
pca = PCA().fit(X_train)
print(pca.explained_variance_ratio_) 
[9.80235252e-01 1.76388461e-02 1.90610748e-03 1.20447599e-04
 8.72383033e-05 6.84948034e-06 3.78245320e-06 8.65500617e-07
 3.55017700e-07 1.58090830e-07 6.51888543e-08 1.49593710e-08
 6.92494738e-09 4.50523385e-09 3.03706337e-09 1.27544572e-09
 8.53783527e-10 4.87077530e-10 4.12137475e-10 3.36099268e-10
 1.68075533e-10 1.30882995e-10 8.10116499e-11 5.40419239e-11
 3.54896936e-11 2.75410670e-11 7.61430739e-12 6.24731176e-12
 4.98193116e-12 1.63035465e-12]

Note that we obtained the same explained variances when using SVD in 'Xzero'.

In this tutorial, we will use the library StandardScaler to scale the training set for us.

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

sklearn's transform's fit() just calculates the parameters (e.g. 𝜇 and 𝜎 in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.

Xs_train = scaler.transform(X_train)

We will later apply the same transformation to the testing set, using the same two parameters 𝜇 and 𝜎 (values) that you used for centering the training set.

2) Fitting the PCA algorithm with the rescaled training set:

In summary, all we have to do now is:

In [20]:
pca = PCA().fit(Xs_train)
var_exp = pca.explained_variance_ratio_
print(var_exp)
[4.30255335e-01 1.96996395e-01 1.00129588e-01 6.58115680e-02
 5.43017590e-02 4.00702846e-02 2.22959251e-02 1.66088475e-02
 1.30348731e-02 1.15705377e-02 1.00644698e-02 9.06789620e-03
 8.28608754e-03 5.36584258e-03 2.99146910e-03 2.45522098e-03
 2.12900097e-03 1.72582342e-03 1.56893905e-03 1.04893989e-03
 9.49736547e-04 8.85518251e-04 7.65335345e-04 5.56751131e-04
 4.83118271e-04 2.65795914e-04 2.30387337e-04 5.37173237e-05
 2.67907073e-05 4.04699316e-06]

You can check above that you would obtain the same explained variances if you have used Xscaled to obtain the SVD

3) Deciding the number of principal components to use for the classification model

Plot the explained variance and the cummulative explained variance. If you want to retain the principal components that capture 80% of the variance, how many components should you choose?

In [21]:
cum_var_exp = np.cumsum(var_exp)
print(cum_var_exp)

plt.bar(range(len(var_exp)), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
plt.step(range(len(var_exp)), cum_var_exp, where='mid',
             label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(bbox_to_anchor=(1.1, 1.05))
plt.tight_layout()
[0.43025533 0.62725173 0.72738132 0.79319289 0.84749464 0.88756493
 0.90986085 0.9264697  0.93950457 0.95107511 0.96113958 0.97020748
 0.97849357 0.98385941 0.98685088 0.9893061  0.9914351  0.99316092
 0.99472986 0.9957788  0.99672854 0.99761406 0.99837939 0.99893614
 0.99941926 0.99968506 0.99991544 0.99996916 0.99999595 1.        ]
/anaconda3/lib/python3.6/site-packages/matplotlib/tight_layout.py:176: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  warnings.warn('Tight layout not applied. The left and right margins '

4) Creating the new (reduced) dataset for the classification model

Define the variable Z_train the projected training set with reduced feature space that captures 90% of the variance (remember Z_train = X_train V where V is the right singular vector matrix?)

How many components should you consider?

Two different approaches...

You can now project your test set onto the Principal Components of the training set.

In [24]:
Z_train = pca.transform(Xs_train)
Z_test = pca.transform(Xs_test)
In [25]:
# set colors for each data point
C = []
classes = list(set(Y_train))
for c in Y_train:
    if c == classes[0]:
        C.append('r')
    else:
        C.append('b')

plt.figure()
plt.title("Training Data Projected into 2D Subspace")
plt.xlabel("1st Component")
plt.ylabel("2nd Component")

plt.scatter(Z_train[:, 0], Z_train[:, 1], c=C)
plt.show()

Logistic regression

We will first use a logistic regression model. This is used for classification tasks where data points can only be a member of one class. The model can be solved either using a modified version of least squares or newton's method.

In [26]:
from sklearn.linear_model import LogisticRegression

Make an instance of the model, using all the default parameters for now.

In [27]:
model = LogisticRegression(solver="lbfgs")
In [28]:
#Recall...
print(Z_train.shape,Xs_train.shape)
(392, 7) (392, 30)

Training the model on the data

In [29]:
model.fit(Z_train,Y_train)
Out[29]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

Once you have the trained model, you can get the predicted diagnosis for the patients in the set X_test, and compare them with the actual diagnosis in Y_test

In [30]:
Ypredict = model.predict(Z_test)
In [31]:
# Accuracy: Overall, how often is the classifier correct?
(Y_test==Ypredict).mean()
Out[31]:
0.9880952380952381
In [32]:
test_score = model.score(Z_test, Y_test)
print(test_score)
0.9880952380952381
In [33]:
from sklearn.metrics import roc_curve, roc_auc_score, auc, f1_score, confusion_matrix

Confusion matrix: describes the performance of the classification model on a set of test data for which the true values are known.

In [34]:
# use "labels" argument to re-organize the entries in the confusion matrix
# (we typically treat malignant as positive and benign as negative)
cmat = confusion_matrix(Y_test,Ypredict, labels=["M", "B"])
print(cmat)
TP, FP, FN, TN = cmat.ravel()
npatients = Y_test.shape[0]
[[ 56   1]
 [  1 110]]

Confusion matrix = $ \begin{bmatrix} \textrm{Predicted positive and actual positive} & \textrm{Predicted positive and actual negative} \\ \textrm{Predicted negative and actual positive} & \textrm{Predicted negative and actual negative}\end{bmatrix} = \begin{bmatrix} \textrm{TP} & \textrm{FP} \\ \textrm{FN} & \textrm{TN} \end{bmatrix}$

TP: Predicted yes (malignant cancer), and the patients do have the disease. (How many malignant tumors are correctly identified? )

FP: Predicted yes (malignant cancer), but patients don't actually have the disease. (How many benign tumors are identified as malignant? Also known as a "Type I error.")

FN: Predicted no (benign cancer), but the patients actually have the disease. (How many malignant tumors are missed? Also known as a "Type II error.")

TN: Predicted no (benign cancer), and patients don't have the disease. (How mamy benign tumors are correctly identified?)

In [35]:
# Accuracy: Overall, how often is the classifier correct? We sum the diagonal of the confusion matrix!
(TP+TN)/npatients 
Out[35]:
0.9880952380952381

Accuracy is not always a reliable metric for the real performance of a classifier. Let's say there are 100 patients and 5 have malignant cancer. If the model classifies all the 100 patients as not having cancer, than the overall accuracy is 95% (TP = 0, TN = 95, FN = 5, FP = 0).

Let's take a look at other metrics:

True positive rate (TPR), or sensitivity or recall: When it's actually yes, how often does it predict yes? TPR = TP/(TP+FN)

True negative rate (TNR) or specificity: When it's actually no, how often does it predict no? TNR = TN/(FP+TN)

For the example above, the classifier has 100% specificity (predicts 100% of the benign cancers) and 0% sensitivity (fails to classify all cancerous tumors).

The "best" metric to use will depend on the classification problem. Back to our original dataset...

In [36]:
# True positive rate (TPR), or sensitivity or recall: When it's actually yes, how often does it predict yes?
TPR = TP/(TP+FN)
print(TPR)
0.9824561403508771
In [37]:
# True negative rate(TNR) or specificity:  When it's actually no, how often does it predict no?
TNR = TN/(FP+TN)   
print(TNR)
0.990990990990991
In [38]:
# Precision: When it predicts yes, how often is it correct?
p = TP/(TP+FP)
print(p)
0.9824561403508771

Checking other algorithms:

In [39]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

dict_classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Linear SVM": SVC(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Naive Bayes": GaussianNB()
}
In [40]:
# Determine accuracy of classifier given data pairs (X, Y)
def accuracy(classifier, X, Y):
    Ypredict = classifier.predict(X)
    return (Y==Ypredict).mean()

def batch_classify(X_train, Y_train, X_test, Y_test, verbose = True):
    print("{:34s}{:15s}{:12s}{:12s}".format("Classifiers", "Test-score", "TPR", "TNR"))
    train_score = []
    test_score = []
    labels = []

    for classifier_name, classifier in list(dict_classifiers.items()):
        
        # Fit the model
        labels.append(classifier_name)
        classifier.fit(X_train, Y_train)
        # Make predictions on the test data
        predictions = classifier.predict(X_test)
        cmat = confusion_matrix(Y_test, predictions, labels=["M", "B"])
        TP, FP, FN, TN = cmat.ravel()
        # Get scores
        train_score.append(accuracy(classifier, X_train, Y_train))
        test_score.append(accuracy(classifier, X_test, Y_test))
        if verbose:
            print("{c:30}{t1:12f}{t2:12f}{t3:12f}".format(c=classifier_name,t1=test_score[-1], t2=TP/(TP+FN), t3=TN/(FP+TN) ), )
In [41]:
batch_classify(Z_train, Y_train, Z_test, Y_test, verbose = True)
Classifiers                       Test-score     TPR         TNR         
Logistic Regression               0.988095    0.982456    0.990991
Nearest Neighbors                 0.964286    0.947368    0.972973
Linear SVM                        0.970238    0.948276    0.981818
Gradient Boosting Classifier      0.970238    0.948276    0.981818
Random Forest                     0.934524    0.925926    0.938596
Naive Bayes                       0.922619    0.940000    0.915254
In [42]:
batch_classify(Xs_train, Y_train, Xs_test, Y_test, verbose = True)
Classifiers                       Test-score     TPR         TNR         
Logistic Regression               0.988095    0.982456    0.990991
Nearest Neighbors                 0.976190    0.981818    0.973451
Linear SVM                        0.964286    0.947368    0.972973
Gradient Boosting Classifier      0.958333    0.946429    0.964286
Random Forest                     0.970238    0.964286    0.973214
Naive Bayes                       0.934524    0.925926    0.938596